Daily Weather Dataset

Daily Weather Dataset

In this article, we analyze a weather dataset from Kaggle.com.

Data description from Kaggle:

Loading the Dataset

Problem Description

Let's set Relative Humidity (Afternoon) as the target variable. This means given the dataset and using the rest of the features, we would like to know whether is humid or not at 3 PM. In doing so, define a Humidity Level (Afternoon) feature as follows:

$$\text{Humidity Level (Afternoon)} = \begin{cases} 0 &\mbox{Very Low} \\ 1 &\mbox{Low} \\ 2 &\mbox{Medium} \\ 3 &\mbox{High} \end{cases}$$

We can visualize the data using Parallel Coordinates.

However, the results of this visualization can be improved if a clustering method is used. For this reason, we K-Means clustering method.

Training and testing sets

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Modeling: Random Forest Classifier

A random forest classifier (RFC) fits several decision tree classifiers on (using sub-samples of the dataset) and then averages them to improve the predictive accuracy. See sklearn.ensemble.RandomForestClassifier for more details.

Random Forest Classifier with Default Parameters

Random Forest Classifier with the Best Parameters and Feature Ranking


References

  1. Kaggle Dataset: Daily Weather Dataset
  2. scikit-learn Random Forest Classifier
  3. Random Forest Classifier Wikipedia page
  4. Mower, Jeffrey P. "PREP-Mt: predictive RNA editor for plant mitochondrial genes." BMC bioinformatics 6.1 (2005): 1-15.